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Abstract 

This paper is concerned with learning binary classifiers under adversarial label-noise. We 
introduce the problem of error- correction in learning where the goal is to recover the original 
clean data from a label-manipulated version of it, given (i) no constraints on the adversary 
other than an upper-bound on the number of errors, and (ii) some regularity properties 
for the original data. We present a simple and practical error-correction algorithm called 
SubSVMs that learns individual SVMs on several small-size (log-size), class-balanced, ran- 
dom subsets of the data and then reclassifies the training points using a majority vote. Our 
analysis reveals the need for the two main ingredients of SubSVMs, namely class-balanced 
sampling and subsampled bagging. Experimental results on synthetic as well as benchmark 
UCI data demonstrate the effectiveness of our approach. In addition to noise-tolerance, 
log-size subsampled bagging also yields significant run-time benefits over standard SVMs. 



1. Introduction 

Learning in the presence of noise is notoriously difficult; there are many negative results 



noise 


Ben-David et al. 


(2003 


); 


Hastad 


(1997); Kearns et al. 



(1994); Long and Servedio (2011), while positive results are mostly known only for the case 



of random noise or under strong distributional assumptions Blum et al. (1996); Kalai et al. 



(2008 


); 


Sastry et al. 


(S 


5010); Servedio ( 


2003). Somewhat more encouraging results exist 


in max 


-mar 


gin settings 


Buja and Stuetzle 


(2000 


); 


Har-peled et al. 


(2006 


); Shalev-Shwartz 


et al. 


( 


2010 


); 


Xu et al. 


( 2006 ) but these methods are computationally prohibitive even for 



reasonably-sized data. 

In this paper, we investigate the learning of binary classifiers under adversarial (worst- 
case) label-noise. We introduce the problem of error- correction in learning, as the task 
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of correcting the label-errors in training data, D, given that the original (clean) data, D, 



intrinsically satisfies some regularity properties. (Given negative results such as Guruswami 



and Raghavendra ( 2006 ) regarding the hardness of learning better-than-random hyperplanes 



even from nearly-separable data, some notion of regularity becomes essential). Informally, 
D is said to be r- regular if SVMs trained on very small random r-subsets of D, make less 
than ^-fraction errors over all of D. We show that every linearly separable D exhibits some 
regularity, and that such a D can be recovered from any D with roug hly (| - 20 -0(log 2 re- 
fraction of errors. The main idea in our analysis is to apply margin-based generalization 
bounds under a chosen sampling distribution over D and to then adjust the bounds for the 
noise in D. To the best of our knowledge, this is the first positive result that is known 
about learning classifiers under adversarial label-errors. 

Our algorithm for error-correction, called SubSVMs (Subsample bagging of SVMs) is as 
follows: Train SVMs on suitably-small, class-balanced, random subsets of D and reclassify 
every training point using a simple majority vote. We show that class-balanced sampling 
over D minimizes the worst-case probability of drawing less than any-chosen-number of 
clean points per class from D. The number of worst-case errors that each SVM in the 
ensemble makes can grow as the squared-log of the subsample-size used, and this leads us 
to the final error-correction performance of SubSVMs. 

In experimental work, we first study the error-correction achievable on synthetic lin- 
early separable data. By comparing against performance under uniform sampling (common 
in standard bagging) we show that class-balanced sampling plays a vital role in error- 
correction. Then we show that error-correction based on SubSVMs leads to better classifiers 
which outperform regular SVMs on a range of benchmark data sets from the UCI Machine 
Learning Repository. Our experiments also clearly demonstrate superiority of SubSVMs 
over regular bagging. We inject high-levels of label-noise in the training data sets (Num- 
ber of errors was fixed at 75% of the size of the minority class). On previously unseen 
(clean) test sets, SubSVMs even outperformed SVMs that directly used the full test sets for 
cross-validation. Subsampling at logarithmic sizes also gives SubSVMs substantial run-time 
advantages over standard SVMs and regular bagging. 

Related Work: Several results show that learning under adversarial noise can be NP- 



hard Hastad (1997); Kearns et al. (1994); Feldman et al. (2006); Guruswami and Raghaven- 



dra (2006). Better results (polynomial-time algorithms) are known in the context of learning 



max- margin classifiers from noisy data Har-peled et al. (2006); Shalev-Shwartz et al. (2010); 



Xu et al. (2006). However, these techniques are computationally prohibitive in practice, 



e.g., the method proposed in Xu et al. (2006) uses SDP solvers that can become impractical 
even for a hundred training points. Many boosting algorithms, with convex potential func- 



tions, have also been shown vulnerable to random classification noise Long and Servedio 
( 2010| ). 

In statistical (rather than adversarial) settings, generalization results for SVMs demon- 
strate efficient learnability when training and test points are drawn iid from the same 
(even if noisy) distribution Christianini and Shawe- Taylor (2000). Some works have focused 
on the ineffectiveness of SVMs in the presence of outliers and for noisy class-imbalanced 



data (e.g., see Akbani et al. (2004); Trafalis and Gilbert (2005); Nath and Bhattacharyya 



(2007)), albeit without formal analysis. Recently, large-margin half-spaces were shown to 
be efficiently learnable under small amounts of malicious noise Long and Servedio (2011). 
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Similarly, Dekel and Shamir (2009) demonstrates learning from multi-teacher data, where 
a small number of teachers can replace randomly chosen examples arbitrarily. A general 



framework for distribution-dependent learning in-the-limit was proposed in Caramanis and 
Mannor (2008); the focus, however, was on establishing informational limits rather than 
sample complexities. We consider learning under adversarial label-errors given that the 
original data satisfies some regularity properties. Our error-model is relevant both when 
the label-errors are inadvertent, whether systematic or random, and when errors are intro- 
duced by an adversary explicitly trying to mislead the learning process. 

Several studies investigated why (and under what conditions) bagging works by for- 
malizing different notions of stability for predictors and by showing that bagging reduces 
the variance of unstable predictors (see, e.g., Breiman (1996); Buhlmann and Yu (2002); 



Elisseeff et al. (2005); Grandvalet| ( 2004 ) ) . Experimental bias-variance analysis of random 



aggregation and bagging of SVMs demonstrated that working with small samples achieves 
greater reduction in the variance component of error than standard bagging (see Valentini 
(2004)). In another related work, Brodley and Friedl (1999) presented an experimental 



study of various methods for identifying mislabeled data. All these studies, including the 
ones that analyze bagging, restricted attention to distribution-based models, rather than 
adversarial settings. 



2. Error correction problem in learning 

Let D = {(xi,yi) : i = 1, . . . ,£} be the set of examples in a binary classification problem; 
the feature vectors, Xi, come from some domain X and the class-labels, yi, take values from 
{— 1, +1}. The proportion of minority class points in D is denoted (3, < (3 < 0.5. 

Let denote a binary SVM classifier trained on for x G X, the classifier returns 
the label ^r>(x) G {— 1,+1}. We assume that is suitable for the given classification 
task. However, D is not available to train the learning algorithm. Instead, the learner only 
has access to D = {(xj, yi) : i = 1, . . . ,£}, which is a label-manipulated version of D^] 

The adversary is allowed to flip labels of no more than p{3£ examples in D, where p is 
referred to as the error parameter. Since we place no other restrictions on the points the 
adversary can manipulate, we must have the constraint < p < 1 (otherwise, we may be 
left with no training examples for one class). 

The error-correction problem is concerned with recovering the original clean data D 
(or a close approximation of it) from its label-manipulated version D. To this end, we will 
allow some 'regularity' assumptions on the original data D, which essentially guarantee that 
SVMs trained on sufficiently-small random subsets of D can classify the points in D with 
high accuracy. Regularity is an intrinsic property of the original data, which can manifest 
and be measured in many ways; one way is to measure the redundancy structure exposed 
by the quadratic program underlying the max-margin formulation of SVMs. 

Definition 1 (Data Regularity) LetT>* be any (discrete) probability distribution over~D 
and let S ~ P* ; |S| > r, denote a collection of points drawn iid from D*. For any 5 < 0.5 
and 9 < 0.5, D is said to be r-regular at (5, 9) if with probability at least 1 — 5 over choice 

1. denotes the SVM trained on S, etc. 

2. D is also referred to as the corrupted or noisy data. 



3 



Srivatsan Laxman, Sushil Mittal and Ramarathnam Venkatesan 



of S, the expected error-rate of does not 9 with respect to test examples also drawn iid 
from T>* . 

We are interested in regularity at small r, such as at 0{\og() or 0(log 2 £). Data regular- 
ity can be thought of as a measure of redundancy needed to admit learning in the presence 
of adversarial label-noise. This is, in a sense, akin to the redundancy encoded into a mes- 
sage for enabling error-correction in coding theory. Regularity is a simple property that is 
satisfied by data from which good binary classifiers can be easily learnt, e.g., every linearly 
separable data set is regular. 

Lemma 2 (Separability implies Regularity) Consider any linearly separable D with 
margin 7. For any fixed 5 < 0.5 and 9 < 0.5, there exists r £ Z + such that D is r-regular 
at (6,9). 

The proof makes use of the following 2-norm soft-margin bound from SVM generalization 



theory Christianini and Shawe- Taylor (2000): 



Theorem 3 (Christianini and Shawe- Taylor , 2000\ Theorem 4. 22) Consider thresholding 



Pr Jf(x) + y\<- ( ^-tMl log 2 £ + log I ) (1) 



real-valued linear functions £ with unit weight vectors on an inner product space X and fix 
7 £ M + . There is a constant c, such that for any probability distribution T> on X x { — 1, +1} 
with support in a ball of radius R around the origin, with probability 1 — 5 over £ random 
(training) examples D = {(27, yi), . . . , (x£,yi)}, any hypothesis f £ C has error no more 
than 

„ / D2 1 lltl 

Hill 2 i.,,7 

(x,y)~V" " ' ' ~ £ \ 7 2 

where £ = (£1, ...,&) is the margin slack vector with respect to f and 7. The entries of £ 
are fixed as follows: £j = max(0, 7 — yif(xi)), i = 1, . . . , £. 

Since D is separable with margin 7, every subset of D is also separable with margin at 
least 7. Thus, the max-margin separator of every subset of D will have margin slack vector 
£ = (with respect to the chosen subset). Fixing T> = T>^ in Theorem^ the generalization 
error of trained on any S ~ 2?*, |S| = r, is given by 



c ( R , 2 , 1 

• l -* o." ■ .vj 2 log /' • log 



Pr^ [tf s (x) ^y) < - ( — log 2 r + log- ) (2) 



Lemma ^follows since the RHS of (§) is 0(log 2 r/r). 



Definition 4 (Error-correction in Learning) Given that D and D disagree on no more 
than p/3-fraction of labels, and given that D satisfies some regularity properties, the problem 
of error-correction in learning is to recover a data set D with as few label disagreements 
with D as possible. 

We make no assumptions regarding the nature of label-errors (such as if they are statistical 
or otherwise), or regarding the separate values of error-parameter (p) and true fraction of 
minority-class ((3); we are only given that the total fraction of label-errors does not exceed 
PP, < p < 1 and < /3 < 0.5. 



4 



Error Correction in Learning using SVMs 



Algorithm 1 [SubSVMs] Subsampled bagging of SVMs 

Input: Corrupted data D = {(xi, y±), . . . , (xx, ye)}; size, s, of subsample; sampling bias p\ 

number of SVMs J (typically, p = \ and s = log t or s = log 2 £) 
Output: Error-corrected data D = {(xx,yx), ■ ■ ■ , yi)} 

/* Training */ 
for j = 1 to J do 

Draw random subset Sj ~ T> g of size |S, | = s 

Train SVM vl' 

/* Error-correction */ 
for i = 1 to £ do 

Set y~i to the majority label in {^g^Xj), 

Output D = . . . , (xx,ye)} 



...,* gj (Xi)} 



3. The SubSVMs algorithm 

We first define a key ingredient of the SubSVMs algorithm that we refer to as p-biased 
sampling. 

Definition 5 (p-biased Sampling) The process of p-biased sampling of D refers to the 
following two steps, executed in the stated order: (1) choose the minority clas$\ of D with 
probability p (or the other class with probability 1 — p) and (2) pick a point uniformly at 
random from the restriction of D to the chosen class. The corresponding sampling distri- 
bution is denoted T> p Q and S ~ P p g denotes that S is a random collection of points drawn 
iid with respect to ^ p g- 

The case of p = 0.5 is referred to as class-balanced sampling of D; if (3 denotes the 
fraction of minority class points in D, the case of p = (3 is equivalent to uniform sampling 
over D. 

Algorithm [7] lists the pseudo-code for subsampled bagging of SVMs (SubSVMs). Our 
analysis (in Sees. |3.1 3.2) reveals two important aspects of SubSVMs: 

• Class-balanced sampling provides optimal protection against worst-case label-errors. 

• The fraction of errors that can be tolerated (pf3) reduces as the squared-log of sample- 
size s. 

Based on the above, we use class-balanced sampling (p = 1/2) and choose s to be log^ or 
log 2 t. 



3.1 Error correction analysis 

Our analysis uses the margin-based generalization bound for SVMs with respect to a sam- 
pling distribution over the original (clean) data D and then adjusts the bound to accom- 
modate the number of label-errors in the corrupted training set D. 

3. If both classes of D are of identical size, one of them is arbitrarily fixed as the 'minority class'. 
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Consider the general case of Algorithm 7J where the random subsets Sj are drawn iid 
from T> p Q. Let D be linearly separable with margin 7. Consider a set of points S ~ 2? p g- 
We now need to compute the expected error-rate of with respect to test points drawn 
uniformly from D (This is the main quantity of interest in the error-correction setting). For 
this, we first compute the expected error-rate e when the training and test cases are both 
drawn iid from 2? p g. This is done by using Theorem^ ( Christianini and Shawe- Taylor 2000 
Theorem 4.22) with / = and T> = 2? g (See next paragraph for details). The error-rate 
can at most become e /p* , where p* = min{p, 1 — p}, when considering test cases drawn 
uniformly from Drl Finally, in any uniformly drawn sample from D, the expected fraction 



of label disagreements with respect to the corresponding points in D is p/3. Hence, the 
desired expected error-rate of ^g, where S ~ tart the test points are drawn uniformly 
from D, is given by e/p* + p/3. 

We now return to the computation of error-rate e when train and test points are both 
drawn iid from T> g. Whenever S contains at least r/2 clean points per class, the SVM of 

the corresponding r-size (clean) subset of S would make no more than (s — r) mistakes on 
the rest of S. Each of these mistakes would be no farther than 2R from either supporting 
hyperplane. Also, the margin of this SVM would be at least 7 (the max-margin achieved 
on the whole of D). The 2- norm SVM objective has the same form as the error-bound in 
(jl|. Hence, we apply Theorem [5] with = 4i? 2 (s — r) and with margin 7, to obtain the 
generalization bound, e. If 77 is an upper-bound on the probability that S contains less than 
r/2 clean points from either class, then with probability at least (1 — 77 — 5) 



Pr [*g(x) ^ y] < 



c (R 2 + AR 2 {s 



pD 



Y 



log s + log 



1 



dcf 



e. 



Recall that this error-rate, e, over test points drawn from 2? p g, translates to an error-rate of 
e/p* + pf3 for test points drawn uniformly over D. Thus, the final expression for probability 
of error of ^g with respect to test points drawn uniformly from D, denoted <p, can be 
written as follows: 



Pr[*g(x) +y\ < (1-7,-5) 



+ P/3 



, x def 
+ 77 + = (f. 



We use J SVMs based on J random sets such as S. Thus, if ip < 0.5, then (by Hoeffding 



Inequality Hoeffding (1963)) the probability of a majority vote making a mistake with 
respect to D cannot exceed exp[— 2J(0.5 — <p) 2 }- This gives us error-correction (in the sense 
that D can be correctly recovered from D). To enforce the condition ip < 0.5, we must have 
p[3 < 1 — e/p* — [2(1 — 77 — 5)]~ 1 . Finally, if D is r-regular at (5, 9), then we have 



< 



R2 1 2 , 1 , 

log s + log - + 



+ 



c /4R 2 (s 



^\og 2 s 



This leads to our main result about SubSVMs: 



4. See Appendix [a| for a short proof. 



4R 2 



T 



log 2 s 



(3) 
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Figure 1: The data corruption process. Let A be the minority class with /3-fraction of points 
in D. Ab represents the a-fraction of corrupted points, originally in class- .B, but 
wrongly assigned to class- A in D; similarly Bb represents the class- A points in 
D that were mislabeled as class- -B in D. A total of p(3- fraction of points are 
corrupted in D. 



Theorem 6 (Error-correction) Consider linearly separable D with margin 7 and In- 
fraction of minority- class points. Fix 5 < 0.5 and let D be r -regular at (5,9). Consider D 
with error-rate p and S ~ ^pg; 1*^1 = s ' P r [S contains < r/2 clean points per class] < 
77. // the number of label-errors in D is bounded by 



p(3 < 1 - 29 



1 



2(1-77-5) 



+ 



4R 2 c(s — r) log 
7 2 s 



2 ^ 



(4) 



where R denotes the radius of the ball enclosing the data and c is the constant from The- 
orem [3| then the probability of error for SubSVMs with respect to points drawn uniformly 
from D is at most exp [— 2 J(0.5 — ip) 2 ] , where ip = 77 + <5 + (1 — 77 — 5) [e/p* + p0\ and 
p* = min{p, 1 — p} . 



Hence, perfect error-correction is attained for ip < 0.5. 



3.2 Importance of Class-balanced Sampling 

The bound in Q has two groups of parameters. In the first group, we have r, 5 and 9, 
which are fixed by the regularity properties of D. In the second group, we have s and 77, 
which are both determined by our sampling strategy. Since 77 depends on the sampling bias 
p, we now discuss how to fix p and s for optimal error-correction performance. 

From Q it is clear that, to maximize the number of errors that can be tolerated, we must 
minimize the quantity in square brackets. The first term inside the brackets is minimized 
when rj is minimum. Fig. [T] provides a graphical depiction of the data corruption process. 
The optimal value of 77 typically depends on the direction-of-attack parameter, a, the error 
parameter p, and the true size, /3, of the minority class in D. However, neither of these is 
known to the learner; only an upper-bound on the fraction of label-errors in D is known. 
So we design our algorithm to limit the impact of worst-case label-errors. Specifically, we 
choose p = 0.5 since it minimizes 77 in a manner that is agnostic to the true values of a, p 
and j3. We state this formally in Lemma [?] below. 

Lemma 7 (Class-balanced Sampling) Fix any r G Z + . Given D with (3-fraction of 
minority- class points and D with at most pf3 -fraction label-errors w.r.t. D, class-balanced 



7 



Srivatsan Laxman, Sushil Mittal and Ramarathnam Venkatesan 



sampling o/D minimizes a worst-case upper-bound on n (probability that the sample drawn 
contains less than r/2 clean points per class) if the size, s (>r), of the sample satisfies 



i 



s > 2r + 4 (r log 2 + log 2 2 - log 4) 2 + log 16-4 (5) 

The main intuition behind the proof is that, in the absence of any specific information 
regarding p, (3 and a, choosing the sampling bias p on either side of 0.5 is vulnerable to one 
of the attack directions, thereby increasing the worst-case value of rj. (See Appendix [B| for 
the proof). 

The second term inside the square brackets of Q is smallest (and equal to zero) for 
s = r. However, Lemma^shows that this is not optimal for rj, since s = r fails the condition 
in ([5]). In fact, for smaller s, r\ may even be maximized at p = 0.5; in general, the minimizer 
of n will no longer be agnostic to p, f3 and a. However, when s is set to the lower-bound 
of (JHJ), the second term inside square brackets of Q becomes <3(log 2 r). This gives us our 
next lemma. 

Lemma 8 (Subsampled Bagging) Let D be linearly separable and r-regular at (5, 9) and 
let D contain at most (p (3) -fraction of adversarial label-errors. SubSVMs based on class- 
balanced sampling and with sub sample- size, s, set to the lower-bound in can perfectly 
recover the originalT) , provided the fraction of label- errors in D is bounded above as follows: 

pf3<l-26- \— + 0(log 2 r)l (6) 

2(l-r] -5) 



Since the above lemma requires s to be set at the lower-bound of ([5]) it might appear 
that we are operating on a knife-edge for choosing the subsample size. Luckily, this is not 
the case, because if the data is regular at r, it would also be regular with same for every 
r' > r. Hence, we could set s to the lower bound in ^ corresponding to r' and the above 
Lemma would still hold, though with 0(log 2 r') rather than 0(log 2 r) inside the square 
brackets. As a result, the number of worst-case errors allowed reduces for r' > r and this 
is the reason why we use subsampled bagging. Typically, we choose s to be log I or log 2 1 
(rather than £, which is the usual case in bagging). As long as the data is r-regular for some 
r < s that satisfies ([5]) SubSVMs will give us error-correction. As a side-benefit subsampling 
at logarithmic sizes will give us dramatic run-time advantages over regular SVMs. Our 
experimental results clearly demonstrate this aspect of SubSVMs. 

4. Experiments 

We present experimental results of SubSVMs on simulated, linearly separable data as well 
as LIBSVM extracts of some UCI data set^J SVMs are known to perform well on these data 
sets, so they can play the role of clean data in our experiments. 

Our data corruption process follows Fig. [TJ Given 'clean' training data D of size £ 
with minority class of size f3£, < (3 < 0.5, the parameters p and a control the corruption. 
We randomly pick p/3l points for corruption, of which, a-fraction are picked uniformly at 

5. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets 
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Figure 2: Importance of class-balanced sampling (p = 1/2) in Algorithm^ (a) Worst-case 
and (b) average-case accuracy obtained using class-balanced sampling (p = 1/2) 
over different label-manipulated data sets as a function of J. (c) Worst-case 
and (d) average-case accuracy obtained using SubSVMs with uniform sampling 
(p = /9). 



random from the minority class and (1 — a)-fraction from the other. By varying the attack 
direction a, we generated a wide range of corrupted data with different degrees of difficulty 
for binary classification. 

4.1 Synthetic Data Experiments 

In the first experiment, we generated 'clean' data sets D comprising of 1000 ci-dimensional 
data points from a mixture of two Gaussian distributions, each with a covariance of 0.11^ 
and a distance of two units between means. Three values of d were used: 2, 16 and 30. A 
constant margin of 0.2 units was enforced and misclassified points were manually removed. 
The value of (3 was varied between [0.05, 0.5] in steps of 0.05, p = 0.75 and a was varied 
between [0.0, 1.0] in steps of 0.25. 

We studied the importance of class-balanced sampling in Algorithm [7] (SubSVMs) by 
comparing two versions of it - one with class-balanced sampling (p = 1/2) and the other 
with uniform sampling {p = 0). For every d, the data corresponding to each [(3, a] pair 
was subjected to 10 random corruptions. Figs. [2^i and[2jo summarize the results for class- 
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Training set 


Test set 


Data 


Feature 


Total 


Size of 


Total 


Size of 


set 


dimension 


size 


minority class 


size 


minority class 


ala 




1605 


395 (25%) 


30956 


7446 (24%) 


cl2cl 




2265 


572 (25%) 


30296 


7269 (24%) 


cl3cl 




3185 


773 (24%) 


29376 


7068 (24%) 


a4a 




4781 


1188 (25%) 


27780 


6653 (24%) 


a5a 


123 


6414 


1569 (25%) 


26147 


6272 (24%) 


splice 


60 


1000 


A on (A o(Tf \ 

483 (48/oJ 


2175 


1 n A A /AO 07 \ 

1044 (48%) 


mushrooms 


llz 


6093 


2937 (48%) 


2031 


979 (48%) 


svmguidel 


4 


3089 


1089 (35%) 


4000 


2000 (50%) 


wla 




2477 


72 (3%) 


47272 


1407 (3%) 


w2a 




3470 


107 (3%) 


46279 


1372 (3%) 


w3a 




4912 


143 (3%) 


44837 


1336 (3%) 


w4a 


300 


7366 


216 (3%) 


42383 


1263 (3%) 


w5a 




9888 


281 (3%) 


39861 


1198 (3%) 



Table 1: LIBSVM UCI data extracts and their characteristics. 



balanced sampling and Figs. [2J: and [2]l for uniform sampling. As expected, based on 
Theorem [6j the number of mistakes made decays exponentially with increasing J. Near- 
perfect error-correction is achieved using p = 1/2 for J as small as 2 7 . For p = (3, the 
worst-case and average-case performances are worse by about 60% and 20%, respectively. 
This experimentally validates Lemma [7] for using class-balanced sampling in SubSVMs. 

4.2 UCI Data Experiments 

We now report the performance of SubSVMs on held-out test data using the LIBSVM UCI 
extracts. There can be two ways to test this, either the error-corrected training data can be 
used to retrain a fresh standard SVM or we can just use majority voting over the J SVMs 
already trained in SubSVMs. In our experiments, both these approaches yielded very similar 
results. Therefore, we avoid retraining cost and report results using the majority voting 
method. 

Table [T] shows the data characteristics of the 13 data sets used. The fraction of the 
minority class, f3 ranges from 0.03 to 0.48 in training sets and from 0.03 to 0.50 in test sets. 
Also, the feature dimension varies between 4 to 300. Note that although these data sets 
are not linearly separable, they are still referred to as 'clean' before they are subjected to 
label- manipulation. For generating different types of attacks, p = 0.75 was used while the 
value of a was varied between [0.0, 1.0] in steps of 0.25. We compare SubSVMs against of 
four other SVM-based classifiers: 

1. Oracle-SVM: Standard SVM learnt over training data with parameters fixed by cross- 
validating directly over clean test set. 

2. Blind-SVM: Standard SVM learnt over training data with parameters fixed based on 
the best average performance over all test sets. This is similar to Oracle-SVM, except 
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that a single set of parameters is used for all data sets. This helps assess the feasibility 
of blindly fixing the same set of parameters for all test sets. 

3. Bag-SVM: Regular bagging of SVMs where each SVM in the ensemble is trained on a 
bootstrap sample of size same as the original data (sampled with replacement). All 
SVMs use the same set of optimum parameters, which were determined through test 
set cross-validation of Oracle-SVM. 

4. CV-SVM : Standard SVM with parameters chosen through four-fold cross-validation on 
the training data. In all the experiments, the results of CV-SVM are averaged over five 
different random splits of the training data for cross-validation. 

All cross-validations were performed by varying the penalty parameter C between 1 and 
100, ratio of the weights of the two classes W between 0.1 and 10 and the RBF kernel 
parameter a 1 between 0.1/d and W/d, where d is the data dimensionality. For SubSVMs, 
the values of C = 100, w = 1, a 2 = 1/d, s = log 2 £ and J = 1000 were fixed for all data sets 
without performing any sort of cross-validation. All the SVMs were trained under L-2 loss, 
although similar results were also obtained under L-l loss. 

Note that Oracle-SVM, Blind-SVM and Bag-SVM use information about test set labels 
to obtain their corresponding set of optimum parameters for training. This gives them an 
unfair advantage over CV-SVM and SubSVMs that are both agnostic to test set labels. 

Performance measure: The UCI data sets exhibit a wide range of class imbalance - 
ala-a5a are moderately imbalanced, splice, mushrooms and svmguidel are class-balanced 
while wla-w5a are highly imbalanced. For imbalanced data, high accuracies can be trivially 
achieved by labeling all points with the majority class label. Since accuracy is ineffective in 
such settings, we use its skew-insensitive version called Balanced Accuracjj^] (BAC) Broder- 
sen et alT] ( |2010 ) . Note that for class-balanced data, BAC reduces to accuracy. 



Table [2] summarizes the results of all the five methods on clean as well as corrupted 
versions of the data. For every data set, 10 random corruptions were performed w.r.t. the 
corresponding attack direction a and the averaged results are reported. Winning results, 
when significantly better than the rest, are highlighted^] 

• SubSVMs is almost always significantly better than all the other methods (by 5% 
or more) and is never significantly worse. The advantage of SubSVMs is visible in 
both balanced and imbalanced data; for imbalanced data, the advantage increases for 
smaller a. This is because the quality of minority-class data falls sharply with a. 

• Oracle-SVM is at least as good as Blind-SVM. This is because Oracle-SVM tunes 
parameters individually for each test set, while Blind-SVM fixes the same parameters 
across all test sets. 

• Oracle-SVM, Blind-SVM and Bag-SVM are better than CV-SVM. This is because all 
three methods cross- validate directly on the test sets. 

6. See Appendix [c| for details of this measure. 

7. Std. devs. were negligible (mostly < 0.02, max 0.06). 
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Table 2: Balanced Accuracy (BAC) under L-2 loss for clean and noisy versions of UCI data 
sets. For different types of attacks (different a) the results for each method are 
averaged over 10 different noisy versions, 'splc', 'mush' and 'svml' stand for splice, 
mushrooms and svmguidel. Only CV-SVM and SubSVMs are agnostic to the true 
test labels. The cases where one of the methods is significantly better than all 
others (> 0.05) are highlighted. 



Bag-SVM's performance is similar to that of Oracle-SVM. This is consistent with Valen 



tini (2004) that also reported no benefit in bagging SVMs (since SVMs are stable 



classifiers) . 



CV-SVM is the worst performing method and is often significantly worse than other; 
This shows its ineffectiveness under noisy settings. 



Similar results were also obtained using Skew-Insensitive F-score (SIF) Flach (2003). 



Results using Area Under the Curve (AUC) and accuracy, their unsuitability for imbalanced 
data notwithstanding, are reported in Appendix [Dj 



8. The case of clean, balanced data is the only exception. 
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Table 3: Training times in seconds (rounded to the closest integer) for all the methods 
trained using L-2 loss averaged over different corruptions corresponding to the 
results presented in Table [2] Note that for SubSVMs, the reported time is the time 
taken to train all the J = 1000 SVMs on s = log 2 l-size subsets. 



Run-times: Table [3] summarizes training times averaged over different types of attacks. 
SubSVMs is clearly much faster than all other methods^} While our experiments were based 
on single-core implementations, SubSVMs can be easily parallelized to handle very large-scale 
problems. 

5. Conclusions 

We present a simple algorithm (SubSVMs) for learning binary classifiers under adversarial 
label-noise. SubSVMs can efficiently correct a bounded number of adversarial label-errors 
introduced in linearly separable data. Extensions to handle attribute noise and multi-class 
settings are important directions for future work. It would also be interesting to explore 
applicability of SubSVMs for solving large, noisy, real-world problems, where SVMs typically 
perform poorly. 



9. See Appendix p for more detailed run-times. 
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Appendix A. Error-rate of ^§ w.r.t. samples drawn uniformly from D 

Let ei and e 2 be the class conditional error rates for the two classes. Without loss of 
generality let e 2 > ei- In the absence of the knowledge whether e 2 is associated with the 
minority class or the majority class, the overall error rate of w.r.t. samples drawn iid 
from T> p ^ is given by 

e = max (pei + (1 - p)e 2 , (1 - p)ei + pe 2 ) < e 2 . (7) 
Therefore, if e = pei + (1 — p)e 2 , then 

£2 = -: < (8) 

1 — p 1 — p 



and if e = (1 — p)e\ + pe 2 , then 



Therefore, 



where p* = min{p, 1 — p}. 



e, = £ " (1 " P)61 < (9) 
p P 



e 2 < max ( — *— , - ) = (10) 
1 — p p / p* 



Appendix B. Proof for optimality of class-balanced sampling (p = 0.5) 

Consider a two-class classification problem where the two classes are represented by A 
and B. Without loss of generality, let A be the minority class containing < (3 < 0.5 
fraction of the points. Let A and B represent the two classes after one or both the classes 
are corrupted with adversarial noise. Let p(3, < p < 1 represent the upper limit on the 
fraction of corrupted points. Therefore, the total number of corrupted points can be written 
as n c = p(3£. Further, let a be the fraction of the corrupted points that were originally in 
class B but were assigned to class A. Therefore, the fraction of the new classes can be given 
by 

\A\ = {3 + ap/3 - (1 - a)p/3 (11) 
\B\ = 1 - - app + (1 - a)pp (12) 

Moreover, the fraction of good (clean) and bad (mislabeled) points in both the classes are 

\A g \=(3-(l-a)pf3 (13) 

\A b \ = ap(3 (14) 

\B g \ = l-0- app (15) 

\B b \ = (1 - a)pp (16) 
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Therefore, the conditional probability of picking a good or a bad point for both the classes 
are given by 

P(a 9 \A) = = - l ~ {l ~ a)p , (17) 
9 \A\ l + ap-(l-a)p 

P(&6|B) " W\ " i-P-*pP + (i-«)pP (20) 

Assuming that the probability with which points from classes A and I? are picked is given 
by P(A) = p and P(B) = 1 — p respectively, the probability of picking up a good or a bad 
point for both the classes are respectively given by P(a g ) = P(a g \A)p, P(a b ) = P(ai,\A)p, 
P{b g ) = P{b g \B){l-p) and P(b b ) = P(b b \B)(l-p). 

The probability rj of not picking r/2 clean points from either class is upper bounded by 

r/2-1 . . r/2-1 
»<E J C 1 " ^M) - "* + E (J (! - • ( 21 ) 

fc=0 ^ ' k=0 ^ ' 

For worst case analysis, we need to maximize 77 and therefore, minimize both P(a g ) and 
P(b g ), which in turn requires minimizing P(a g \A) and P(b g \B) w.r.t. both a and Dif- 
ferentiating P{a g \A) w.r.t a 

da " (i- p + 2ap) 2 " 1 J 

Therefore, 

arg min P{a g ) = 1. (23) 
Similarly, differentiating P(6 ff |S) w.r.t. a 

dP(bg\B) P0(1-P(1+P)) 



da (1-p + pP- 2apf3) 2 

Therefore, 



> (24) 



argminP(6 9 ) = 0. (25) 

a 

Also, 

dP(b g \B) = -p(l - a) 

da (i-p + pp- 2ap/3) 2 ~ { ' 

implying that 

argminP(6 9 ) = \ . (27) 
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Substituting a = 1 in P(a g ) and a = 0,/3 = l/2in P(b g ), we get 

P i . „/, 1 — p 



min P(a n 



l + P 



and minPf 



1 + p' 



(28) 



Therefore, the worst case bound for (21) can be written as 

r/2-l 



»7 



< 



k=0 



P 



kj \l + p 



1 



P 



l + P 



-* r/2-l 

+ £ 

fc=0 



1 — p 

r+p 



i 



i — p 



s—k 



• (29) 



Applying Hoeffding bound Hoeffding ( 1963 ) individually on each of the two terms 



as long as > $ - 1 and > | - 1. The RHS of 

° l+P 2 l+P 2 



30 



can be rewritten as 



+ 2 exp 



! /( i- p) -(^m±p) 



(31) 



which is simply the sum of two Gaussian with means pi = ^ r 2 \^ +p ^ and p2 = 1 " 
and equal variance a = Differentiating the above expression w.r.t. p 



(r-2)(l+p) 
2s 



df_ 
dp 



2 ( JP- - r + 1 

^ I l+P 2 + i 



gCLzg) 
i+p 



+ 1 



-p i ! (i^ - 5 



i) 2 ) d + />) + ex P (f (££ga 



+ i 



(32) 



(l + P) 



It can be clearly seen that p = 0.5 is a solution of (32). Also, the sum of two Gaussians 



can be either unimodal (p = 0.5 is global maximum) or bimodal (p = 0.5 is a minimum) 



Behboodian (1970). The second order derivative of / w.r.t. p can be written as 

2 



11 

dp 



l+p 2 ^ i 



2s 



s(l-p) _ r 
l+P 



+ 1 



2s 



e* P (§(l^-§ + l) 2 ) (l + P) 2 + ex P (f(^-i + l) 2 ) (1 + 



(33) 



p) 5 



Therefore, enforcing a minimum at p = 0.5, we get the condition that 

2 



dp 



32 



2(l+p) 2 



+ 1 



8s 



P=0-5 / 2 

eX P I s V 2(l+p) 2 



+ 1) \(l + P y expMf^-f + ll (l + p) 



This directly implies that 



*>(P+1) lr-2 + ^(p + l) (l + ( ^j^ 7 



1/2N 



> 0. 

(34) 

(35) 
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which, as expected, is a stronger condition than the one required for imposing the Hoeffding 
bound at p = 0.5, i.e., s > (1 + p)(r — 2). Furthermore, to enforce p = 0.5 to be the global 
minimum, we impose the condition that the value of / at p = 0.5 is strictly less than that at 
any of the two extreme points of / (i.e., at p = -(l+p)(r/2— 1) andp = 1 — ^(l+p)(r /2 — 1)). 
This gives us an even stronger condition 

• > + D (r - 2 + (P + 1) (log 2 + ( '° g2 "° g2+ ;;^'° g2 - 4 > ) ) . (36) 

This is the sufficient condition to guarantee that the worst case probability of selecting 
less than r/2 clean points per class is minimum at p = 0.5, i.e., when class-balanced sampling 
is performed over the data. 



Appendix C. Details of performance metrics 

For class-imbalanced data sets, very high classification accuracy can be trivially obtained 
by labeling the entire data with the majority class label. The use of Balanced Accuracy 



(BAC) for class-imbalanced data sets is prescribed by Brodersen et al. (2010) and can be 
simply computed as 

BAC = sensitivity + specificity ^ 

The sensitivity and specificity are defined as follows 

tp 

sensitivity = - — — - — (38) 
tn 

specificity = - — — — (39) 
tp + fn 

where tp and fp denote the number of true and false positives while tn and fn denote the 
number of true and false negatives. 

Similarly, traditional F-score can be trivially maximized for imbalanced data sets by 



compromising recall for high precision. Therefore, SIF Flach (2003) serves as an alternative 
to the F-score for imbalanced data sets and is given by 

SIF = 2 -f— (40) 

tpr + jpr + 1 

where tpr and fpr are true and false positive rates respectively. Like BAC, SIF also reduces 
to traditional F-score for class-balanced data sets. Another popular metric for comparison 
of classification performances is Area Under the ROC Curve (AUC). Although, unlike BAC 
and SIF, AUC is not a skew-insensitive measure, we also computed AUC measures for all 
the methods. It is important to mention that SubSVMsis always comparable to that of the 
other methods w.r.t. AUC. Finally, we note that for the results reported using AUC, we 
needed to retrain an SVM on the error-corrected data (unlike earlier, when we directly used 
majority voting on the test data). 
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n 79 
u. / z 


n qq 
u.yy 


n so 
u.oy 


n a ^ 

U.40 


n /is 

U.4o 
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U.Ol 


U.Oo 


n ^ 

U.OO 




DllnCl O Vl v l 


n ^1 


n ^9 
u.oz 


n ^9 
u.oz 


n ^9 
u.oz 


n ^9 
u.oz 


n pa 

U.04 


n p^ 

U.DO 


n so 
u.oy 


n a i 

U.41 


n A7 

U.4 1 


n ^1 

U.Ol 


n 

U.Oo 


n ^ 

U.OO 


OL — u.o 


Dag DVrl 


n ^n 


n ^1 

U.01 


n ^i 

U.01 


n ^i 

U.Ol 


n ^i 

U.Ol 


n 79 
u. / z 


n qq 
u.yy 


n so 
u.oy 


n a i 

U.41 


n a i 

U.4o 


n /is 

U.4o 


n ^n 

U.OU 


n 

U.Oo 




UV ovrl 


n 

U.ZD 


n 91 
U.Z1 


n qi 
U.ol 


n Q/i 
U.o4 


n Q/i 
U.o4 


n 71 
U. 1 1 


n 

U.yo 


n 71 
U. 1 1 


n Q9 
U.oz 


n 9q 
U.Zo 


n 9Q 
U.Zy 


n qi 
U.ol 


n 9q 
u.zy 




SubSVMs 


0.80 


0.80 


0.81 


0.81 


0.82 


0.75 


0.98 


0.94 


0.78 


0.81 


0.81 


0.84 


0.85 




Oracle-SVM 


0.22 


0.22 


0.23 


0.22 


0.22 


0.20 


0.37 


0.01 


0.16 


0.17 


0.19 


0.18 


0.18 




Blind-SVM 


0.22 


0.22 


0.23 


0.22 


0.22 


0.13 


0.37 


0.01 


0.14 


0.17 


0.18 


0.18 


0.18 


a = 0.0 


Bag-SVM 


0.14 


0.15 


0.15 


0.16 


0.16 


0.15 


0.31 


0.01 


0.12 


0.14 


0.15 


0.14 


0.15 




CV-SVM 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.08 


0.12 


0.07 


0.07 


0.03 




SubSVMs 


0.78 


0.78 


0.79 


0.79 


0.81 


0.77 


0.97 


0.96 


0.69 


0.75 


0.80 


0.80 


0.81 



Table 4: Skew-Insensitive F-score (SIF) under L-2 loss for clean and noisy versions of UCI 
data sets. For different types of attacks (different a) the results for each method 
are averaged over 10 different noisy versions, 'splc', 'mush' and 'svml' stand for 
splice, mushrooms and svmguidel. Only CV-SVM and SubSVMs are agnostic to the 
true test labels. The cases where one of the methods is significantly better than 
all others (> 0.05) are highlighted. 



Appendix D. Additional Results 

Tables |4j [5] and [6] present additional results on the UCI data sets under L-2 loss using 
Skew-Insensitive F-Score (SIF), Area Under the Curve (AUC) and Accuracy, respectively. 
Table [7] shows detailed run-times corresponding to Table [3] in the paper. 
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Method 


ell 3j 


cl2cL 


£l3cl 


a4a 




splc 


mush 


svml 


wla 


w2a 


w3a 


w4a 


w5a 




Oracle-SVM 


0.89 


0.90 


0.90 


0.90 


0.90 


0.96 


1.00 


1.00 


0.93 


0.95 


0.96 


0.96 


0.96 




Blind-SVM 


0.89 


0.90 


0.90 


0.90 


0.90 


0.95 


1.00 


0.99 


0.93 


0.95 


0.95 


0.96 


0.96 


clean 


Bag-SVM 


0.88 


0.89 


0.89 


OA 


39 


0.89 


0.96 


1.00 


1.00 


0.80 


0.88 


0.90 


0.91 


0.94 




CV-SVM 


0.89 


0.90 


0.90 


0.90 


0.90 


0.96 


1.00 


1.00 


0.91 


0.95 


0.96 


0.96 


0.96 




SubSVMs 


0.89 


0.89 


0.89 


0.90 


0.90 


0.93 


0.98 


0.99 


0.90 


0.92 


0.92 


0.93 


0.94 




Oracle-SVM 


0.88 


0.89 


0.89 


0.1 


19 


0.89 


0.88 


1.00 


0.99 


0.92 


0.93 


0.94 


0.94 


0.95 




Blind-SVM 


0.88 


0.89 


0.89 


OA 


19 


0.89 


0.85 


1.00 


0.99 


0.91 


0.93 


0.94 


0.94 


0.94 


a = 1.0 


Bag-SVM 


0.87 


0.88 


0.89 


OA 


19 


0.89 


0.88 


0.56 


0.99 


0.98 


0.92 


0.88 


0.92 


0.90 




CV-SVM 


0.86 


0.87 


0.87 


OA 


18 


0.88 


0.88 


0.99 


0.99 


0.89 


0.92 


0.94 


0.94 


0.94 




SubSVMs 


0.88 


0.88 


0.89 


OA 


19 


0.89 


0.88 


0.98 


0.99 


0.86 


0.90 


0.91 


0.92 


0.93 




Oracle-SVM 


0.88 


0.88 


0.88 


OA 


18 


0.89 


0.82 


1.00 


0.99 


0.91 


0.92 


0.93 


0.94 


0.94 




Blind-SVM 


0.88 


0.88 


0.88 


OA 


18 


0.89 


0.82 


1.00 


0.96 


0.91 


0.92 


0.92 


0.93 


0.94 


a = 0.5 


Bag-SVM 


0.82 


0.84 


0.85 


OA 


16 


0.86 


0.85 


1.00 


0.99 


0.98 


0.97 


0.89 


0.89 


0.86 




CV-SVM 


0.84 


0.84 


0.86 


OA 


16 


0.88 


0.78 


0.97 


0.99 


0.89 


0.91 


0.92 


0.93 


0.93 




SubSVMs 


0.87 


0.87 


0.88 


OA 


18 


0.89 


0.82 


0.99 


0.99 


0.86 


0.88 


0.89 


0.91 


0.92 




Oracle-SVM 


0.87 


0.88 


0.88 


0.89 


0.89 


0.87 


1.00 


0.99 


0.89 


0.90 


0.92 


0.93 


0.93 




Blind-SVM 


0.87 


0.88 


0.88 


0.89 


0.89 


0.87 


1.00 


0.98 


0.88 


0.90 


0.92 


0.93 


0.93 


a = 0.0 


Bag-SVM 


0.50 


0.50 


0.50 


0.50 


0.50 


0.53 


0.08 


0.94 


0.98 


0.98 


0.98 


0.98 


0.98 




CV-SVM 


0.86 


0.86 


0.88 


OA 


18 


0.89 


0.86 


1.00 


0.99 


0.87 


0.88 


0.91 


0.92 


0.92 




SubSVMs 


0.87 


0.88 


0.88 


0.89 


0.89 


0.88 


0.99 


0.99 


0.83 


0.86 


0.88 


0.90 


0.91 



Table 5: Area Under the Curve (AUC) under L-2 loss for clean and noisy versions of UCI 
data sets. For different types of attacks (different a) the results for each method 
are averaged over 10 different noisy versions, 'splc', 'mush' and 'svml' stand for 
splice, mushrooms and svmguidel. Only CV-SVM and SubSVMs are agnostic to the 
true test labels. The cases where one of the methods is significantly better than 
all others (> 0.05) are highlighted. 
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Method 


ala 




3,3 cl 


a4a 




splc 


mush 


svml 


wla 


w2a 


w3a 


w4a 


w5a 




Oracle-SVM 


0.84 


0.85 


0.85 


0.85 


0.85 


0.91 


1.00 


0.97 


0.98 


0.98 


0.98 


0.98 


0.99 




Blind-SVM 


0.84 


0.84 


0.84 


0.85 


0.85 


0.90 


1.00 


0.96 


0.98 


0.98 


0.98 


0.98 


0.98 


clean 


Bag-SVM 


0.84 


0.85 


0.85 


0.85 


0.85 


0.90 


1.00 


0.97 


0.98 


0.98 


0.98 


0.98 


0.99 




CV-SVM 


0.84 


0.84 


0.85 


0.85 


0.85 


0.91 


1.00 


0.97 


0.98 


0.98 


0.98 


0.98 


0.99 




SubSVMs 


0.78 


0.78 


0.78 


0.79 


0.79 


0.86 


0.98 


0.96 


0.84 


0.84 


0.83 


0.85 


0.86 




Oracle-SVM 


0.80 


0.79 


0.81 


0.81 


0.81 


0.56 


0.63 


0.89 


0.98 


0.98 


0.98 


0.98 


0.99 




Blind-SVM 


0.79 


0.78 


0.80 


0.81 


0.81 


0.51 


0.49 


0.82 


0.98 


0.98 


0.98 


0.98 


0.98 


a = 1.0 


Bag-SVM 


0.80 


0.79 


0.81 


0.81 


0.81 


0.53 


0.61 


0.89 


0.98 


0.98 


0.98 


0.98 


0.99 




CV-SVM 


0.77 


0.77 


0.78 


0.80 


0.79 


0.48 


0.49 


0.89 


0.98 


0.97 


0.98 


0.98 


0.98 




SubSVMs 


0.73 


0.73 


0.74 


0.74 


0.74 


0.77 


0.97 


0.90 


0.78 


0.76 


0.77 


0.82 


0.81 




Oracle-SVM 


0.80 


0.80 


0.81 


0.81 


0.81 


0.74 


0.99 


0.90 


0.98 


0.98 


0.98 


0.98 


0.98 




Blind-SVM 


0.80 


0.80 


0.81 


0.81 


0.81 


0.70 


0.95 


0.85 


0.97 


0.97 


0.97 


0.97 


0.97 


a = 0.5 


Bag-SVM 


0.80 


0.80 


0.81 


0.81 


0.81 


0.74 


0.99 


0.90 


0.98 


0.98 


0.98 


0.98 


0.98 




CV-SVM 


0.79 


0.80 


0.80 


0.80 


0.80 


0.71 


0.95 


0.90 


0.97 


0.97 


0.98 


0.98 


0.98 




SubSVMs 


0.75 


0.75 


0.75 


0.76 


0.76 


0.74 


0.97 


0.93 


0.75 


0.75 


0.81 


0.82 


0.83 




Oracle-SVM 


0.77 


0.77 


0.77 


0.77 


0.77 


0.57 


0.63 


0.50 


0.97 


0.97 


0.97 


0.97 


0.97 




Blind-SVM 


0.77 


0.77 


0.77 


0.77 


0.77 


0.55 


0.63 


0.50 


0.97 


0.97 


0.97 


0.97 


0.97 


a = 0.0 


Bag-SVM 


0.77 


0.77 


0.77 


0.77 


0.77 


0.56 


0.61 


0.50 


0.97 


0.97 


0.97 


0.97 


0.97 




CV-SVM 


0.76 


0.76 


0.76 


0.76 


0.76 


0.52 


0.52 


0.50 


0.97 


0.97 


0.97 


0.97 


0.97 




SubSVMs 


0.80 


0.79 


0.79 


0.80 


0.80 


0.79 


0.97 


0.96 


0.86 


0.84 


0.82 


0.87 


0.93 



Table 6: Accuracy under L-2 loss for clean and noisy versions of UCI data sets. For different 
types of attacks (different a) the results for each method are averaged over 10 
different noisy versions, 'splc', 'mush' and 'svml' stand for splice, mushrooms and 
svmguidel. Only CV-SVM and SubSVMs are agnostic to the true test labels. The 
cases where one of the methods is significantly better than all others (> 0.05) are 
highlighted. 
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Method 




a2a 




a4a 




splc 


mush 


svml 


wla 


w2a 


w3a 


w4a 


w5a 




Oracle-SVM 


180 


266 


376 


601 


937 


38 


102 


19 


213 


286 


372 


532 


805 




Blind-SVM 


175 


258 


360 


588 


917 


38 


81 


17 


207 


269 


347 


483 


746 


clean 


Bag-SVM 


64 


134 


276 


617 


1675 


64 


167 


38 


26 


51 


103 


215 


369 




CV-SVM 


138 


286 


579 


1465 


2860 


190 


1033 


102 


82 


160 


293 


654 


1146 




SubSVMs 


3 


3 


3 


4 


4 


4 


7 


3 


3 


3 


3 


4 


5 




Oracle-SVM 


291 


418 


606 


957 


2211 


47 


372 


61 


387 


568 


840 


1521 


2528 




Blind-SVM 


292 


420 


610 


960 


2213 


47 


370 


59 


383 


568 


840 


1520 


2527 


a = 1.0 


Bag-SVM 


73 


147 


300 


703 


3272 


69 


1179 


358 


59 


129 


224 


710 


1951 




CV-SVM 


242 


499 


1034 


2616 


5036 


214 


3143 


532 


157 


329 


678 


1633 


3029 




SubSVMs 


4 


5 


5 


6 


7 


5 


8 


4 


4 


4 


5 


6 


7 




Oracle-SVM 


281 


408 


588 


931 


1980 


55 


1513 


67 


304 


460 


639 


1144 


1905 




Blind-SVM 


280 


408 


590 


938 


1980 


55 


1507 


67 


304 


459 


640 


1144 


1908 


a = 0.5 


Bag-SVM 


100 


256 


518 


860 


3859 


72 


3937 


330 


39 


76 


146 


365 


828 




CV-SVM 


224 


465 


936 


2364 


4598 


262 


6597 


659 


108 


229 


443 


1095 


2070 




SubSVMs 


4 


5 


5 


6 


7 


5 


8 


4 


4 


4 


5 


6 


7 




Oracle-SVM 


118 


180 


256 


393 


598 


31 


257 


28 


120 


181 


237 


347 


495 




Blind-SVM 


118 


171 


240 


385 


571 


31 


273 


22 


123 


176 


229 


329 


482 


a = 0.0 


Bag-SVM 


21 


42 


80 


197 


342 


46 


899 


105 


14 


29 


54 


112 


206 




CV-SVM 


75 


154 


311 


771 


1510 


141 


2332 


160 


34 


65 


123 


278 


507 




SubSVMs 


3 


3 


4 


4 


5 


5 


7 


3 


3 


7 


3 


4 


5 



Table 7: Training times in seconds (rounded to the closest integer) for all the methods 
trained using L-2 loss averaged over 10 random label-manipulated versions of the 
data sets, corresponding to the results presented in Table 1 of the paper, 'splc', 
'mush' and 'svml' stand for splice, mushrooms and svmguidel. Note that for 
SubSVMs, the reported time is the time taken to train all the J = 1000 SVMs on 
s = log 2 ^-size subsets. 
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